pandas: Apply functions to values, rows, columns with map(), apply()

您所在的位置:网站首页 python apply pandas: Apply functions to values, rows, columns with map(), apply()

pandas: Apply functions to values, rows, columns with map(), apply()

2024-07-17 06:23| 来源: 网络整理| 查看: 265

In pandas, you can use map(), apply(), and applymap() methods to apply functions to values (element-wise), rows, or columns in DataFrames and Series.

Contents Apply functions to values in Series: map(), apply() How to use map() How to use apply() Apply functions to values in DataFrame: map(), applymap() Apply functions to rows and columns in DataFrame: apply() Basic usage Specify rows or columns: axis Specify arguments for the function: Keyword arguments, args Pass as ndarray instead of Series: raw Apply functions to specific rows or columns Use methods of DataFrame and Series, and arithmetic Operators Use NumPy functions Speed comparison

As mentioned later, DataFrame and Series already include methods for common operations. Additionally, you can apply NumPy functions to DataFrame and Series. Using dedicated methods or NumPy functions is preferable to map() or apply() due to better performance.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd import numpy as np print(pd.__version__) # 2.1.2 print(np.__version__) # 1.26.1 source: pandas_numpy_function.py Apply functions to values in Series: map(), apply()

To apply a function to each value in a Series (element-wise), use the map() or apply() methods.

pandas.Series.map — pandas 2.1.3 documentation pandas.Series.apply — pandas 2.1.3 documentation How to use map()

Passing a function to map() returns a new Series, with the function applied to each value. For example, apply the built-in hex() function to convert integers to hexadecimal strings.

Convert binary, octal, decimal, and hexadecimal in Python s = pd.Series([1, 10, 100]) print(s) # 0 1 # 1 10 # 2 100 # dtype: int64 print(s.map(hex)) # 0 0x1 # 1 0xa # 2 0x64 # dtype: object source: pandas_series_map_apply.py

You can also apply functions defined with def or lambda expressions.

Define and call functions in Python (def, return) Lambda expressions in Python def my_func(x): return x * 10 print(s.map(my_func)) # 0 10 # 1 100 # 2 1000 # dtype: int64 print(s.map(lambda x: x * 10)) # 0 10 # 1 100 # 2 1000 # dtype: int64 source: pandas_series_map_apply.py

The above example is for illustrative purposes; simple arithmetic operations can be directly performed on a Series.

print(s * 10) # 0 10 # 1 100 # 2 1000 # dtype: int64 source: pandas_series_map_apply.py

By default, missing values (NaN) are passed to the function, but if you set the second argument na_action to 'ignore', NaN will not be passed to the function and the result will remain as NaN.

Because the presence of NaN changes the data type (dtype) to a floating-point number (float), values are converted to integers (int) using int() before being passed to hex() in the following example.

s_nan = pd.Series([1, float('nan'), 100]) print(s_nan) # 0 1.0 # 1 NaN # 2 100.0 # dtype: float64 # print(s_nan.map(lambda x: hex(int(x)))) # ValueError: cannot convert float NaN to integer print(s_nan.map(lambda x: hex(int(x)), na_action='ignore')) # 0 0x1 # 1 NaN # 2 0x64 # dtype: object source: pandas_series_map_apply.py

You can also pass a dictionary (dict) to map(). In this case, it replaces values. For more details, refer to the following article.

pandas: Replace Series values with map() How to use apply()

Similar to map(), the function specified as the first argument in apply() is applied to each value. The difference is that apply() allows you to specify arguments to be passed to the function.

With map(), you need to use a lambda expression or similar approach to pass arguments to the function. For example, specify the base argument in the int() function, which converts strings to integers.

s = pd.Series(['11', 'AA', 'FF']) print(s) # 0 11 # 1 AA # 2 FF # dtype: object # print(s.map(int, base=16)) # TypeError: Series.map() got an unexpected keyword argument 'base' print(s.map(lambda x: int(x, 16))) # 0 17 # 1 170 # 2 255 # dtype: int64 source: pandas_series_map_apply.py

With apply(), any specified keyword arguments are passed directly to the function. It is also possible to specify positional arguments using the args argument.

print(s.apply(int, base=16)) # 0 17 # 1 170 # 2 255 # dtype: int64 print(s.apply(int, args=(16,))) # 0 17 # 1 170 # 2 255 # dtype: int64 source: pandas_series_map_apply.py

Note that even if there is only one positional argument, it must be specified as a tuple or list in the args argument. A comma is necessary at the end of a one-element tuple.

A tuple with one element requires a comma in Python

As of version 2.1.2, apply() does not have the na_action argument.

Apply functions to values in DataFrame: map(), applymap()

To apply a function to each value in a DataFrame (element-wise), use the map() or applymap() methods.

As of version 2.1.0, applymap() has been renamed to map() and marked as deprecated.

What’s new in 2.1.0 (Aug 30, 2023) — pandas 2.1.3 documentation pandas.DataFrame.map — pandas 2.1.3 documentation pandas.DataFrame.applymap — pandas 2.1.3 documentation

As of version 2.1.2, applymap() is still usable but issues a FutureWarning.

df = pd.DataFrame([[1, 10, 100], [2, 20, 200]]) print(df) # 0 1 2 # 0 1 10 100 # 1 2 20 200 print(df.map(hex)) # 0 1 2 # 0 0x1 0xa 0x64 # 1 0x2 0x14 0xc8 print(df.applymap(hex)) # 0 1 2 # 0 0x1 0xa 0x64 # 1 0x2 0x14 0xc8 # # /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_36685/2076800564.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead. source: pandas_dataframe_map_applymap.py

The following example uses map(), but applymap() has the same usage and functionality. In versions before 2.1.0, use applymap().

As with map() of Series, the na_action argument can be specified for map() of DataFrame. By default, missing values (NaN) are passed to the function, but if na_action is set to 'ignore', NaN is not passed to the function and the result remains as NaN.

df_nan = pd.DataFrame([[1, float('nan'), 100], [2, 20, 200]]) print(df_nan) # 0 1 2 # 0 1 NaN 100 # 1 2 20.0 200 # print(df_nan.map(lambda x: hex(int(x)))) # ValueError: cannot convert float NaN to integer print(df_nan.map(lambda x: hex(int(x)), na_action='ignore')) # 0 1 2 # 0 0x1 NaN 0x64 # 1 0x2 0x14 0xc8 source: pandas_dataframe_map_applymap.py

Unlike map() of Series, map() of DataFrame passes the specified keyword argument to the function.

df = pd.DataFrame([['1', 'A', 'F'], ['11', 'AA', 'FF']]) print(df) # 0 1 2 # 0 1 A F # 1 11 AA FF print(df.map(int, base=16)) # 0 1 2 # 0 1 10 15 # 1 17 170 255 source: pandas_dataframe_map_applymap.py

As of version 2.1.2, map() of DataFrame does not have the args argument, which means you cannot specify positional arguments.

Apply functions to rows and columns in DataFrame: apply()

To apply a function to rows or columns in a DataFrame, use the apply() method.

pandas.DataFrame.apply — pandas 2.1.3 documentation

For the agg() method applying multiple operations at once, see the following article.

pandas: Aggregate data with agg(), aggregate() Basic usage

Specify the function you want to apply as the first argument.

Note that the built-in sum() function is used for explanation purposes, but if you need to calculate a sum, it is better to use the sum() method mentioned later.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 10 20 30 # Y 40 50 60 print(df.apply(sum)) # A 50 # B 70 # C 90 # dtype: int64 source: pandas_dataframe_apply.py

By default, each column is passed to the function as a Series. If the function cannot accept a Series as an argument, an error will occur.

print(df.apply(lambda x: type(x))) # A # B # C # dtype: object # print(hex(df['A'])) # TypeError: 'Series' object cannot be interpreted as an integer # print(df.apply(hex)) # TypeError: 'Series' object cannot be interpreted as an integer source: pandas_dataframe_apply.py Specify rows or columns: axis

By default, the function is applied to each column. However, setting the axis argument to 1 or 'columns' applies it to each row.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 10 20 30 # Y 40 50 60 print(df.apply(sum, axis=1)) # X 60 # Y 150 # dtype: int64 source: pandas_dataframe_apply.py Specify arguments for the function: Keyword arguments, args

Any keyword arguments specified in apply() are passed to the function being applied. You can also specify positional arguments using the args argument.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 10 20 30 # Y 40 50 60 def my_func(x, y, z): return sum(x) + y + z * 2 print(df.apply(my_func, y=100, z=1000)) # A 2150 # B 2170 # C 2190 # dtype: int64 print(df.apply(my_func, args=(100, 1000))) # A 2150 # B 2170 # C 2190 # dtype: int64 source: pandas_dataframe_apply.py Pass as ndarray instead of Series: raw

By default, each row or column is passed as a Series. If you set the raw argument to True, they are passed as NumPy arrays (ndarray).

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 10 20 30 # Y 40 50 60 print(df.apply(lambda x: type(x), raw=True)) # A # B # C # dtype: object source: pandas_dataframe_apply.py

If there's no need for a Series, using raw=True is faster since the conversion process is omitted. However, if the function requires Series methods or attributes, setting raw=True will raise an error.

print(df.apply(lambda x: x.name * 3)) # A AAA # B BBB # C CCC # dtype: object # print(df.apply(lambda x: x.name * 3, raw=True)) # AttributeError: 'numpy.ndarray' object has no attribute 'name' source: pandas_dataframe_apply.py Apply functions to specific rows or columns

To apply a function to a specific row or column, extract the row or column as a Series and use the map() or apply() methods of Series.

pandas: Select rows/columns by index (numbers and names) df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 10 20 30 # Y 40 50 60 print(df['A'].map(lambda x: x**2)) # X 100 # Y 1600 # Name: A, dtype: int64 print(df.loc['Y'].map(hex)) # A 0x28 # B 0x32 # C 0x3c # Name: Y, dtype: object source: pandas_dataframe_apply.py

You can add them as new rows or columns. If the same row or column names are specified, they will be overwritten.

pandas: Add rows/columns to DataFrame with assign(), insert() df['A'] = df['A'].map(lambda x: x**2) df.loc['Y_hex'] = df.loc['Y'].map(hex) print(df) # A B C # X 100 20 30 # Y 1600 50 60 # Y_hex 0x640 0x32 0x3c source: pandas_dataframe_apply.py Use methods of DataFrame and Series, and arithmetic Operators

In pandas, common operations are provided as methods for DataFrame and Series, so there's no need to use map() or apply().

df = pd.DataFrame([[1, -2, 3], [-4, 5, -6]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 1 -2 3 # Y -4 5 -6 print(df.abs()) # A B C # X 1 2 3 # Y 4 5 6 print(df.sum()) # A -3 # B 3 # C -3 # dtype: int64 print(df.sum(axis=1)) # X 2 # Y -5 # dtype: int64 source: pandas_numpy_function.py

For a list of available methods, refer to the official documentation.

DataFrame - Computations / descriptive stats — pandas 2.1.3 documentation Series - Computations / descriptive stats — pandas 2.1.3 documentation

You can also process DataFrame and Series directly using arithmetic operators.

print(df * 10) # A B C # X 10 -20 30 # Y -40 50 -60 print(df['A'].abs() + df['B'] * 100) # X -199 # Y 504 # dtype: int64 source: pandas_numpy_function.py

Methods for string manipulation are also available through the str accessor of Series.

pandas: Handle strings (replace, strip, case conversion, etc.) df = pd.DataFrame([['a', 'ab', 'abc'], ['x', 'xy', 'xyz']], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X a ab abc # Y x xy xyz print(df['A'] + '-' + df['B'].str.upper() + '-' + df['C'].str.title()) # X a-AB-Abc # Y x-XY-Xyz # dtype: object source: pandas_numpy_function.py Use NumPy functions

You can process DataFrame and Series by passing them to NumPy functions.

For example, although pandas does not provide a method for truncating decimals, you can use np.floor() instead. For DataFrame, a DataFrame is returned; for Series, a Series is returned.

NumPy: Round up/down array elements (np.floor, np.trunc, np.ceil) df = pd.DataFrame([[0.1, 0.5, 0.9], [-0.1, -0.5, -0.9]], index=['X', 'Y'], columns=['A', 'B', 'C']) print(df) # A B C # X 0.1 0.5 0.9 # Y -0.1 -0.5 -0.9 print(np.floor(df)) # A B C # X 0.0 0.0 0.0 # Y -1.0 -1.0 -1.0 print(type(np.floor(df))) # print(np.floor(df['A'])) # X 0.0 # Y -1.0 # Name: A, dtype: float64 print(type(np.floor(df['A']))) # source: pandas_numpy_function.py

It is also possible to specify the axis argument in the NumPy function.

print(np.sum(df, axis=0)) # A 0.0 # B 0.0 # C 0.0 # dtype: float64 print(np.sum(df, axis=1)) # X 1.5 # Y -1.5 # dtype: float64 print(type(np.sum(df, axis=0))) # source: pandas_numpy_function.py Speed comparison

Compare the processing speeds of the map() and apply() methods of DataFrame with other dedicated methods and NumPy functions.

Consider a DataFrame with 100 rows and 100 columns.

df = pd.DataFrame(np.arange(-5000, 5000).reshape(100, 100)) print(df.shape) # (100, 100) source: pandas_map_apply_timeit.py

Note that the following examples use the %%timeit magic command in Jupyter Notebook. They won't work if executed as a Python script.

Measure execution time with timeit in Python

The results for using the built-in abs() function with map(), compared to using the abs() method of DataFrame and the np.abs() function, are as follows. It can be observed that map() is slower.

%%timeit df.map(abs) # 2.07 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit df.abs() # 5.06 µs ± 55 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) %%timeit np.abs(df) # 7.81 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) source: pandas_map_apply_timeit.py

The results for using the built-in sum() function with apply(), compared to using the sum() method of DataFrame and the np.sum() function, are as follows. It can be seen that apply() is slower. Although setting raw=True does speed it up, it is still significantly slower than sum() of DataFrame or np.sum().

%%timeit df.apply(sum) # 932 µs ± 95.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %%timeit df.apply(sum, raw=True) # 427 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %%timeit df.sum() # 35 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) %%timeit np.sum(df, axis=0) # 37.3 µs ± 66.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) source: pandas_map_apply_timeit.py

The map() and apply() methods should be used primarily for complex operations that cannot be achieved with other methods or NumPy functions. If possible, it is better to use other methods or NumPy functions.



【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3